1. Common Objects of Verbs

The Church and Hanks reading shows how interesting semantics can be found by looking at very simple patterns. For instance, if we look at what gets drunk (the object of the verb drink) we can automatically acquire a list of beverages. Similarly, if we find an informative verb in a text about mythology, and look at the subjects of certain verbs, we might be able to group all the gods' names together by seeing who does the blessing and smoting. More generally, looking at common objects of verbs, or in some cases, subjects of verbs, we have another piece of evidence for grouping similar words together.

Find frequent verbs: Using your tagged collection from the previous assignment, first pull out verbs and then rank by frequency (if you like, you might use WordNet's morphy() to normalize them into their lemma form, but this is not required). Print out the top 40 most frequent verbs and take a look at them:



In [1]:

    
import nltk
import re
from nltk.corpus import brown



In [2]:

    
import debates_util



In [3]:

    
debates = nltk.clean_html(debates_util.load_pres_debates().raw())



In [4]:

    
"""
Returns the corpus for the presidential debates with words tokenized by regex below. 
"""

token_regex= """(?x)
    # taken from ntlk book example
    ([A-Z]\.)+        # abbreviations, e.g. U.S.A.
    | \w+(-\w+)*        # words with optional internal hyphens
    | \$?\d+(\.\d+)?%?  # currency and percentages, e.g. $12.40, 82%
    | \.\.\.            # ellipsis
    | [][.,;"'?():-_`]  # these are separate tokens
"""

sent_tokenizer=nltk.data.load('tokenizers/punkt/english.pickle')



In [5]:

    
tokens = nltk.regexp_tokenize(debates, token_regex)



In [6]:

    
def build_backoff_tagger (train_sents):
    t0 = nltk.DefaultTagger('NN')
    t1 = nltk.UnigramTagger(train_sents, backoff=t0)
    t2 = nltk.BigramTagger(train_sents, backoff=t1)
    return t2

tagger = build_backoff_tagger(brown.tagged_sents())



In [7]:

    
tags = tagger.tag(tokens)



In [8]:

    
sents = list(sent_tokenizer.sentences_from_tokens(tokens))



In [9]:

    
v_fd = nltk.FreqDist([t[0] for t in tags if re.match(r"V.*", t[1])])
v_fd.items()[50:100]









    Out[9]:





[('mean', 207),
 ('asked', 206),
 ('support', 206),
 ('called', 195),
 ('trying', 195),
 ('like', 191),
 ('won', 191),
 ('hope', 171),
 ('seen', 170),
 ('change', 169),
 ('thank', 169),
 ('wants', 169),
 ('use', 167),
 ('stand', 165),
 ('gone', 162),
 ('respond', 162),
 ('used', 162),
 ('spend', 160),
 ('provide', 159),
 ('thought', 159),
 ('took', 158),
 ('create', 157),
 ('worked', 155),
 ('comes', 154),
 ('feel', 154),
 ('happened', 154),
 ('goes', 153),
 ('reduce', 153),
 ('lost', 152),
 ('talked', 151),
 ('raise', 150),
 ('happen', 149),
 ('supported', 147),
 ('taken', 147),
 ('cuts', 144),
 ('deal', 143),
 ('lead', 143),
 ('coming', 142),
 ('opposed', 140),
 ('means', 139),
 ('taking', 139),
 ('told', 137),
 ('running', 135),
 ('involved', 134),
 ('passed', 134),
 ('Look', 132),
 ('build', 129),
 ('meet', 129),
 ('send', 129),
 ('protect', 123)]

Pick 2 out interesting verbs: Next manually pick out two verbs to look at in detail that look interesting to you. Try to pick some for which the objects will be interesting and will form a pattern of some kind. Find all the sentences in your corpus that contain these verbs.



In [10]:

    
defend_sents = [s for s in sents if "defend" in s]
[" ".join(s) for s in defend_sents[0:20]]









    Out[10]:





["I ' m just as sick as you are by having to wake up and figure out how to defend myself every day .",
 'WHITE : Many times in its history the United States has gone to war in order to defend freedom in other lands .',
 'WHITE : Vice-President Bush , both Cuba and Nicaragua are reported to be making extensive preparations to defend themselves against an American invasion , which they claim could come this fall .',
 'I found it to be an issue in trying to defend my tax relief package .',
 'And I am going to continue to defend my record and defend my propositions against what I think are exaggerations .',
 'And I have every right in the world to defend my record and positions .',
 'So the question is : would you take military action to defend Berlin ?',
 "And the question is if you surrender or indicate in advance that you ' re not going to defend any part of the free world , and you figure that ' s going to satisfy them , it doesn ' t satisfy them .",
 'The whole th the United States now has a treaty which I voted for in the United States Senate in 1955 to defend Formosa and the Pescadores Island .',
 "I suggest that if Formosa is attacked or the Pescadores , or if there ' s any military action in any area which indicates an attack on Formosa and the Pescadores , then of course the United States is at war to defend its treaty .",
 "And that amendment put the Senate of the United States on record with a majority of the Senator ' s own party voting for it , as well as the majority of Republicans put them on record against the very position that the Senator takes now of surrendering , of indicating in advance , that the United States will not defend the offshore islands .",
 'My point is this : that once you do this follow this course of action of indicating that you are not going to defend a particular area , the inevitable result is that it encourages a man who is determined to conquer the world to press you to the point of no return .',
 "But certainly we ' re not going to have peace by giving in and indicating in advance that we are not going to defend what has become a symbol of freedom .",
 'Therefore that treaty does not commit the United States to defend anything except Formosa and the Pescadores , and to deal with acts against that treaty area .',
 'I would take any action necessary to defend the treaty , Formosa , and the Pescadores Island .',
 'I want the men and women of our Armed Forces to have the support they need to defend us ; the support they need when they risk our lives to keep us free and to keep this country free .',
 'I will defend the right of Roe v .',
 "BUSH : The best way to take the pressure off our troops is to succeed in Iraq , is to train Iraqis so they can do the hard work of democracy , is to give them a chance to defend their country , which is precisely what we ' re doing .",
 "In order to defend ourselves , we ' d have to get international approval .",
 'Just as I fought for our country as a young man , with the same passion I will fight to defend this nation that I love .']



In [11]:

    
help_sents = [s for s in sents if "help" in s]
[" ".join(s) for s in help_sents[0:20]]









    Out[11]:





["Here ' s some thing that ' ll help .",
 "Now I ' ve come out with a new agenda for America ' s renewal , a plan that I believe really will help stimulate the growth of this economy .",
 'And I need the help of everybody across this country to get it passed in a substantial way by the Congress .',
 'And then you can sit down and say , help me do what we should for the cities .',
 'I think you have to help doctors stop practicing defensive medicine .',
 "People don ' t dare medical practitioners , to help somebody along the highway that are hurt because they ' re afraid that some lawyer ' s going to come along and get a big lawsuit .",
 "We don ' t feel that you ' re in any way hurting anybody else by reaching out with affirmative action to help those who ' ve been disenfranchised .",
 'We believe in trying something new to help these black teenage kids ; the minimum wage differential that says , " Look , " to an employer , " hire these guys .',
 "We ' ve got it on civil rights legislation , minority set-asides , more help for black colleges , and we ' ve got it in terms of an economy that ' s offering people opportunity and hope instead of despair .",
 'And in the meantime , the needy are getting more help .',
 "I am sure of my facts , and we are trying to help and I think we ' re doing a reasonable job , but we are not going to rest until every single American that wants a job and until this prosperity and this recovery that ' s benefiting many Americans , benefits all Americans .",
 'Gemayel , that wants to help fight against terrorism .',
 'But let me help you with the difference , Mrs .',
 "As soon as they have a change that allows the people to speak freely , they ' re wanting to develop some kind of blueprint that will help them be like us more , freedom , free markets , political freedom .",
 "I think , for example , if we ' re convinced that a Third World country that ' s got a lot of debt would reform itself , that the money wouldn ' t go into the hands of a few but would go to help people , I think it makes sense for us to use our wealth in that way , or to trade debt for valuable rain forest lands , makes that much sense , yes .",
 "We can help build coalitions but we can ' t put our troops all around the world .",
 "It ' s important for NATO to be strong and confident and to help keep the peace in Europe .",
 "I think our troops ought to be used to help overthrow the dictator when it ' s in our best interests .",
 "And that ' s a case where we need to use our influence to have countries in Africa come together and help deal with the situation .",
 'We did , actually , send troops into Rwanda to help with the humanitarian relief measures .']

Find common objects: Now write a chunker to find the simple noun phrase objects of these four verbs and see if they tell you anything interesting about your collection. Don't worry about making the noun phrases perfect; you can use the chunker from the first part of this homework if you like. Print out the common noun phrases and take a look. Write the code below, show some of the output, and then reflect on that output in a few sentences.



In [12]:

    
np_chunker = r"""
    VPHRASE: {<V.*><DT|AT|P.*|JJ.*|IN>*<NN.*>+}
    """

np_parser = nltk.RegexpParser(np_chunker)



In [13]:

    
t_defend = [tagger.tag(s) for s in defend_sents]
t_help = [tagger.tag(s) for s in help_sents]



In [14]:

    
c_defend = [np_parser.parse(s) for s in t_defend]
c_help = [np_parser.parse(s) for s in t_help]



In [15]:

    
fd_defend = nltk.FreqDist([" ".join(w[0] for w in sub[1:]) for t in c_defend for sub in t.subtrees() if sub.node=="VPHRASE" and sub[0][0].lower()=="defend"])
fd_defend.items()[0:10]









    Out[15]:





[('this country', 3),
 ('freedom', 2),
 ('my record', 2),
 ('the honor', 2),
 ('this nation', 2),
 ('Guantanamo', 1),
 ('a particular area', 1),
 ('against a nuclear war', 1),
 ('against incoming missiles', 1),
 ('against the enemy', 1)]



In [16]:

    
fd_help = nltk.FreqDist([" ".join(w[0] for w in sub[1:]) for t in c_help for sub in t.subtrees() if sub.node=="VPHRASE" and sub[0][0].lower()=="help"])
fd_help.items()[0:10]









    Out[16]:





[('people', 19),
 ('families', 4),
 ('the economy', 4),
 ('local school districts', 3),
 ('Jeremy', 2),
 ('a great deal', 2),
 ('parents', 2),
 ('small businesses', 2),
 ('the farmer', 2),
 ('the people', 2)]

Reflection

Interesting results confirming many of my vague ideas of the things polticians have discussed in the past. A further modification would be to try and group the noun phrases together, for example this country and this nation refer to the same thing.

I would also like to do this split up by each debate to see the change in nouns from election to election.

2. Identify Main Topics from WordNet Hypernms

First read about the code supplied below; at the end you'll be asked to do an exercise.



In [17]:

    
from nltk.corpus import wordnet as wn
from nltk.corpus import brown
from nltk.corpus import stopwords

This code first pulls out the most frequent words from a section of the brown corpus after removing stop words. It lowercases everything, but should really be doing much smarter things with tokenization and phrases and so on.



In [18]:

    
def preprocess_terms():
    # select a subcorpus of brown to experiment with
    words = [word.lower() for word in brown.words(categories="science_fiction") if word.lower() not in stopwords.words('english')]
    # count up the words
    fd = nltk.FreqDist(words)
    # show some sample words
    print ' '.join(fd.keys()[100:150])
    return fd
fd = preprocess_terms()









    



angel around came captain couldn't day face help helva's kind longer look lost must nogol oh outside place saw something words another away called can't come da dead digby gapt give hands however isn't live looked macneff maybe pain part power problem siddo smiled space there's took water yes ago

Then makes a very naive guess at which are the most important words. This is where some term weighting should take place.



In [19]:

    
def find_important_terms(fd):
    important_words = fd.keys()[100:500]
    return important_words

important_terms = find_important_terms(fd)

The code below is a very crude way to see what the most common "topics" are among the "important" words, according to WordNet. It does this by looking at the immediate hypernym of every sense of a wordform for those wordforms that are found to be nouns in WordNet. This is problematic because many of these senses will be incorrect and also often the hypernym elides the specific meaning of the word, but if you compare, say romance to science fiction in brown, you do see differences in the results.



In [20]:

    
# Count the direct hypernyms for every sense of each wordform.
# This is very crude.  It should convert the wordform to a lemma, and should
# be smarter about selecting important words and finding two-word phrases, etc.

# Nonetheless, you get intersting differences between, say, scifi and romance.
def categories_from_hypernyms(termlist):
    hypterms = []                        
    for term in termlist:                  # for each term
        s = wn.synsets(term.lower(), 'n')  # get its nominal synsets
        for syn in s:                      # for each synset
            for hyp in syn.hypernyms():    # It has a list of hypernyms
                hypterms = hypterms + [hyp.name]  # Extract the hypernym name and add to list

    hypfd = nltk.FreqDist(hypterms)
    print "Show most frequent hypernym results"
    return [(count, name, wn.synset(name).definition) for (name, count) in hypfd.items()[:25]] 
    
categories_from_hypernyms(important_terms)









    



Show most frequent hypernym results






    Out[20]:





[(18, 'person.n.01', 'a human being'),
 (14, 'time_period.n.01', 'an amount of time'),
 (9,
  'people.n.01',
  '(plural) any group of human beings (men or women or children) collectively'),
 (8, 'activity.n.01', 'any specific behavior'),
 (7, 'condition.n.01', 'a state at a particular time'),
 (7, 'happening.n.01', 'an event that happens'),
 (7, 'time_unit.n.01', 'a unit for measuring time periods'),
 (7, 'tract.n.01', 'an extended area of land'),
 (6, 'location.n.01', 'a point or extent in space'),
 (6,
  'time.n.03',
  'an indefinite period (usually marked by specific attributes or activities)'),
 (6, 'unit.n.03', 'an organization regarded as part of a larger social group'),
 (5,
  'body_part.n.01',
  'any part of an organism such as an organ or extremity'),
 (5,
  'collection.n.01',
  'several things grouped together or considered as a whole'),
 (5, 'group.n.01', 'any number of entities (members) considered as a unit'),
 (5, 'information.n.01', 'a message received and understood'),
 (5,
  'statement.n.01',
  'a message that is stated or declared; a communication (oral or written) setting forth particulars or facts etc'),
 (4, 'actor.n.01', 'a theatrical performer'),
 (4, 'adult.n.01', 'a fully developed person from maturity onward'),
 (4, 'appearance.n.01', 'outward or visible aspect of a person or thing'),
 (4,
  'attempt.n.01',
  'earnest and conscientious activity intended to do or accomplish something'),
 (4, 'auditory_communication.n.01', 'communication that relies on hearing'),
 (4, 'being.n.01', 'the state or fact of existing'),
 (4, 'celestial_body.n.01', 'natural objects visible in the sky'),
 (4,
  'container.n.01',
  'any object that can be used to hold things (especially a large metal boxlike object of standardized dimensions that can be loaded from one form of transport to another)'),
 (4,
  'content.n.05',
  'the sum or range of what has been perceived, discovered, or learned')]

Here is the question Modify this code in some way to do a better job of using WordNet to summarize terms. You can trim senses in a better way, or traverse hypernyms differently. You don't have to use hypernyms; you can use any WordNet relations you like, or chose your terms in another way. You can also use other parts of speech if you like.

Approach:

This is a modification of my approach in the keyphrase extraction assignement. I am looking at they hypernym path of the most common phrases and finding the most common hypernyms. These are filtered by both their distance from the synset and their depth in the hypernym tree.



In [39]:

    
def get_hypernyms(synsets, max_distance=100):
    """
    Takes a list of synsets (as generated by wn.synsets) and returns a list of all hypernyms. 
    """
    hypernyms = set()
    for synset in synsets:
        for path in synset.hypernym_paths():
            hypernyms.update([h for idx, h in enumerate(path) if h != synset and idx<=max_distance])
    return hypernyms


def fd_hypernyms(fd, depth=None, min_depth=0, max_distance=100, pos=None):
    """
    Takes a frequency distribution and analyzes the hypernyms of the wordforms contained therein. 
    Returns a weighted 
    fd - frequency distribution
    depth - How far down fd to look
    min_depth - A filter to only include synsets of a certain depth.
                Unintuitively, max_depth is used to calculate the depth of a synset. 
    max_distance - The greatest distance a hypernym in the list can be from the synset. 
    pos - part of speech to limit sysnsets to
    """
    hypernyms = {}
    for wf in fd.keys()[0:depth]:
        freq = fd.freq(wf)
        hset = get_hypernyms(wn.synsets(wf, pos=pos), max_distance=max_distance)
        for h in hset:
            if h.max_depth()>=min_depth:
                if h in hypernyms:
                    hypernyms[h] += freq
                else:
                    hypernyms[h] = freq
    
    hlist = hypernyms.items()
    hlist.sort(key=lambda s: s[1], reverse=True)
    return hlist


def concept_printer(concepts, n=20):
    "Prints first n concepts in concept list generated by fd_hypernyms"
    print "{:<20} | {:<10} | {}".format("Concept", "Concept Freq", "Definition")
    print "===================================================================="
    for s in concepts[0:n]:
        print "{:<20} | {:<12.3%} |  {}".format(s[0].lemma_names[0], s[1], s[0].definition)



In [78]:

    
concepts = fd_hypernyms(fd, depth=500, max_distance=4, min_depth=4)
concept_printer(concepts)









    



Concept              | Concept Freq | Definition
====================================================================
artifact             | 6.817%       |  a man-made object taken as a whole
living_thing         | 5.477%       |  a living (or once living) entity
act                  | 5.372%       |  something that people do or cause to happen
person               | 5.185%       |  a human being
content              | 4.789%       |  the sum or range of what has been perceived, discovered, or learned
tell                 | 2.412%       |  let something be known
condition            | 2.237%       |  a state at a particular time
process              | 2.237%       |  (psychology) the performance of some composite cognitive activity; an operation that affects mental contents
happening            | 1.993%       |  an event that happens
time_period          | 1.818%       |  an amount of time
body_part            | 1.725%       |  any part of an organism such as an organ or extremity
natural_object       | 1.631%       |  an object occurring naturally; not made by man
region               | 1.608%       |  a large indefinite location on the surface of the Earth
request              | 1.503%       |  express the need or desire for; ask for
substance            | 1.445%       |  the real physical matter of which a person or thing consists
writing              | 1.328%       |  the work of a writer; anything expressed in letters of the alphabet (especially when considered from the point of view of style and effect)
number               | 1.305%       |  a concept of quantity involving zero and units
point                | 1.247%       |  the precise location of something; a spatially limited location
information          | 1.119%       |  knowledge acquired through study or experience or instruction
material             | 1.107%       |  the tangible substance that goes into the makeup of a physical object

Reflection

I tweaked some of the arguments to this method which perhaps makes it not the best for automatic concept extraction. I need to think more about ways to exclude hypernyms too high on the tree or on finding good defaults for this function.